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Abstract. Learning behavior of simple perceptrons is analyzed for a teacher-student 
scenario in which output labels are provided by a teacher network for a set of possibly 
correlated input patterns, and such that teacher and student networks are of the same 
type. Our main concern is the effect of statistical correlations among the input patterns 
on learning performance. For this purpose, we extend to the teacher-student scenario a 
methodology for analyzing randomly labeled patterns recently developed in J. Phys. A: 
Math. Theor. 41,324013 (2008). This methodology is used for analyzing situations in 
which orthogonality of the input patterns is enhanced in order to optimize the learning 
performance. 
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1. Introduction 

Learning from examples is a fundamental technique for analyzing real-world data, and 
simple perceptrons are included in widely used devices for this purpose. In the last two 
decades, the structural similarity between statistical learning and statistical mechanics 
of disordered systems has been commonly recognized [1] . This similarity has promoted 
statistical mechanical analysis of perceptron learning [21 |3l H]. Such research has led 
to the discovery of various learning behaviors of perceptrons and the development of 
computationally feasible approximate algorithms that had not previously been known 
in conventional learning research [3, El [TJ El [9l [THl [HI [121 [13] • 

Numerous studies have been published on perceptron learning. However, there 
still remain several research directions to explore. Learning from correlated patterns 
is a typical example. As a first step in this direction, the authors recently developed 
methodologies to analyze learning from randomly labeled patterns that are correlated 
in a certain manner on the basis of a formula involving rectangular random matrices 
[T^ [15] . This paper is concerned with a second step; more precisely, we extend the 
methodologies developed for randomly labeled patterns to cases of a teacher-student 
scenario in which output labels are provided by a teacher network, and both teacher 
and student networks are of the same type. 

In earlier studies, asymptotic behavior of learning curves and a critical pattern ratio 
of perfect learning in which the teacher network is completely identified from a reference 
data set of the same order as the network size, have been assessed for continuous and 
discrete weights, respectively, for the case of independently and identically distributed 
(i.i.d.) patterns [6l[7]. Therefore, our main concern herein is how these assessments are 
influenced by correlations among input patterns. Recent deeper understanding of the 
relations among learning, communication and information theories has suggested that 
a perceptron can be a useful building block for various coding schemes [T6l |T7l [IE]- The 
analysis herein may also be a useful guideline for developing efficient schemes to be used 
in information and communication engineering. 

This paper is organized as follows. In the next section, we define the model that 
we shall investigate. In section [3l the main section of this manuscript, we will extend a 
scheme to handle correlated patterns, utilizing a formula involving rectangular random 
matrices, which was developed in [HI [15], to perceptron learning of a teacher-student 
scenario. In section [H the extended scheme will be applied to several examples. The 
final section is devoted to a summary and future work. 

2. Model definition 

For an dimensional input pattern vector x, a single layer perceptron of weight w of 
dimension returns a binary output y G {+1,-1} given by 




(1) 
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where T denotes the matrix transpose and the prefactor l/\fN is introduced to 
keep relevant variables 0(1) as oo. Let us suppose a situation in which a 

student perceptron infers the weight vector of a teacher perceptron, wq, based on a 
given reference data set = { {xj, yi) , (xj, 1/2) ... , (xj, yp) }, where output labels are 
provided by the teacher as = sgn (^N'^^'^xJ^wq) {fi = 1,2, .. . ,p). The problem we 
consider here is how the inference accuracy depends on the pattern ratio a = p/N 
and correlations in the pattern matrix X = N~^/'^{xi^ X2, ■ ■ ■ , Xp)^ as N and p tend to 
infinity, while keeping a = p/N finite. 

As the basis for our analysis, we introduce a representation of the singular value 
decomposition 

X = UDV^, (2) 

where D = disig{dk) is a p x N diagonal matrix consisting of singular values 
{k = 1, 2, . . . , min(p, A^)), and U, V denote p x p and N x N orthogonal matrices, 
respectively. Linear algebra guarantees that any p x N matrices can be decomposed 
according to equation ([2]). The singular values are linked to the eigenvalues 
{k = 1,2,..., A^) of the correlation matrix X'^X via \k = d\ (fc = 1, 2, . . . , min(p, A^)) 
and otherwise, where min(p, A^) denotes the lesser value of p and A^. Orthogonal 
matrices U and V constitute the right and left eigen-bases of X, respectively; i.e., they 
are the eigen-bases of XX^ and X^X. 

To handle the correlations in X somewhat analytically, we assume hereinafter that 
the following two properties hold for the pattern matrix X: 

(i) The eigenvalue spectrum of the correlation matrix X^X, Px'^xi^) = 
X~^^^j^5(A — Afc), tends to a certain specific distribution p(A) in the limit as 
X — »• cxo for typical samples of X. Controlling p(A) allows us to characterize various 
second-order correlations in X. 

(ii) U and V are independently generated from the uniform distributions of p x p and 
N X N orthogonal matrices (the Haar measures), respectively. This assumption 
makes it possible to characterize the correlations in X using only the eigenvalue 
spectrum p(A). 



3. Analytical scheme 

3.1. Expression for the average free energy 

Given the volume of weight vectors that are compatible with which serves as the 
partition function of the current system, can be expressed as 

p/N \ 

w fJ.=l \ k=l / 

where P{w) represents the prior distribution of w and Q{x) = 1 for a; > and 0, 
otherwise. The conventional scheme of statistical mechanics of disordered systems 
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(4) 



Here, [■ ■ -j^p represents an average taken over the reference data set = {X,y} with 
respect to a distribution 



N 



wo M=l V k=l / 

where P{X) denotes the distribution of the pattern matrix X = UDV^ . This implies 
that equation (jlj) can be evaluated as 



(6) 



3.2. Replica analysis 

Equation ([6]) can be evaluated by the replica method. For this, we first evaluate 
= Y^iP P{X)Z''{^P) for n = 1, 2, . . . utilizing the expression 

Zie) = E / fldA,e{y,A,)6 (a, - ^f^x,,^, J 

w n=l \ ^ k=l / 



N 



J2 d^IlP^ 



J]0(y^w^)xe-'""^^ (7) 

k=i ^J.=l 

where i = y/—T, 0(x) = {27i)^^ f dtQ(t)e^^^ and u = {ui, U2, ■ ■ ■ , Up)^. We have assumed 
a factorizable prior P{w) = Y[k=i -^(^fc) analytical tractability. Taking n-th powers, 
for n{= 1,2,...), equation ([7]) yields an expression 



exp 



a=l 



exp 



a=l 



(8) 



For evaluating the average of this equation with respect to X, it is useful to note that for 
fixed sets of dynamical variables {ua} = {ui, U2, . . . , Un} and {wa} = {wi, W2, • • • , Wn}, 
Ua = U^Ua and Wa = V^Wa bchavc as continuous random variables which satisfy the 
strict constraints 



1 1 



1~T~ 

p 



1 
p 



(9) 

(10) 



{a,b = 1,2, ... ,n) when U and V are independently sampled from the Haar measures. 
This indicates that Z"-{^p) can be evaluated by the saddle point method with respect to 
the macroscopic order parameters = {q^b) a^nd = (g^j,) in the limit a.s N,p ^ oo, 
keeping a = p/A^~0(l). Furthermore, due to the intrinsic permutation symmetry with 
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respect to the rephca indices a = 1,2, ... ,n, it is natural to assume that the relevant 
saddle point is rephca symmetric (RS). This assumption can be expressed as 

w ) Xw + Q'ui) {(^ — b), „ J Xu Quj ifl — b>), (^^\ 



^ g^, (a 7^ 6), " I (a ^ b), 

which yields 

^log (^j MP(X)e-'^-i""^'^''^ 

= {n- l)F{xw, Xu) + F{xw + nq^, Xu - nqu), (12) 
for fixed sets of {wa} and {ua}, where 

F{x,y) = Extr {-^logA, - \ (log(A..A, + A)) + ^ + ^} 

- -logx - -logy (13) 

and / VX denotes integration with respect to the pattern matrix X p3l[T5| [T9]. ((■ ■ ■)) 
means an average of (■ ■ ■) with respect to p(A). Extr^.{- ■ ■} denotes the operation of 
extremization with respect to x, which corresponds to the saddle point evaluation of a 
certain complex integral and does not refer to maximization or minimization. 

Equation ( iTTl) and assessment of the volumes of the dynamical variables yield 



a saddle point evaluation of Z^{^p) for n = 1,2,.... However, the functional form 
of Z'^(^P) that we obtain can be defined for real values of n as well. Therefore, we 
analytically continue the expression from n = l,2,...tonGMto evaluate equation 
For n^l, the normalization constraint Z{^p) = P{X)Z{^p) = P{^p) = 1 
implies that relations 

Xw + qw = Tuj = '^ Piw)w^, Xu-qu = 0, (14) 

w 

9F(^Xw ~\~ Q'ui) Xu Qu) „ dF(^Xw ~l" Q'ui) Xu Qu) -'^ / \ \ T-1 /--I r\ 

97. = = 2^^- <i5' 

must hold. These yield a formula for calculating the average free energy as 

$ = - Extr {AujuiQw, Qu) + Aniiqw) + aAuiqu)} , (16) 



where 



and 



Anjuiqw, qu) = F(T^ - g^, g„) + ^ (A) T^g^,, (17) 
AM = Extr + J DzV{z- g^) \ogV{z- g^)| , (18) 

Au{qu) = Extr + 2 j DzH{^z) logi7(7^)| , (19) 



given a particular eigenvalue spectrum p(A). Here, Dz = dze ^ /^/v^27r represents the 
Gaussian measure, H{u) = J^°° Dz, V{z;qw) = X]«,-P(^)6xp —qww'^/2+^yqwZw 
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and 7 = ((A) T^/a — Qu)- Equations (ITBI) - (fT^ contain the main results of this 

paper. 

Two points are noteworthy here. Firstly, being determined by the saddle 
point condition of equation (fT6|) physically means that there is a typical overlap 
N~^wjw between the teacher wo and student w perceptrons after learning. Therefore 
the learning performance can be assessed by solving the saddle point problem of 
equation f|T6l) . In addition, equation f|T6|) itself represents the mutual information (per 
component) between w and y for a given typical pattern matrix X, which measures 
the quantity of information about w that can typically be gained from output labels 
y when X is fixed, and so is useful for characterizing potential capabilities of simple 
perceptrons when they are used for communication purposes. This is also linked 
to the typical entropy (per component) S of the posterior distribution P{w\C,'^) = 

z-\e)p{w) uu (y^N-'^" Ef=i ^,kw,) via 

S = So-^, (20) 

where Sq = — A^^^ ^^P(?Z') logP(?Z') is the entropy (per component) of the prior 
distribution P{w). Secondly, the assumption that the teacher and student networks 
are of the same type corresponds to the Nishimori condition known in spin glass 
research, which implies that the RS solution constructed above is expected to be correct 
[201 l2H l22] . Therefore, we do not proceed to the replica symmetry breaking (RSB) 
analysis here. Treatment in a more general setting, including the local stability analysis 
of the RS solution and the expression of the IRSB free energy, can be found in references 
[11[15]. 



4. Examples 

4.I. Independently and identically distributed patterns 

In order to show consistency with existing results, we firstly apply the scheme that 
we have developed here to the case of i.i.d. patterns, in which entries of the pattern 
matrix X are independently generated from an identical distribution with mean zero 
and variance In the current framework, this case is characterized by the Marcenko- 
Pastur distribution 



where A± = (1 ± y/a)'^ and 



(22) 



Plugging equation ( 12T|) into equation (11311 yields 

a 
2 



F{x,y) = --xy. (23) 
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Applying this to equation (1171) and utilizing the relation (A) = a which holds for equation 
fl2T]) yields the result that AwuilwyQu) = ctQwIuf^- This implies that g„ = at the 
extremum, where g„ is the auxiliary variable in equation ( fT9l) . These conditions mean 
that the average free energy can be calculated as 



$ = - Extr i f D^log [^P(w)e-^^™"''+v^""' ) - - 




2 QwQw 



which is equivalent to the known expression for the free energy of the teacher-student 
scenario with i.i.d. patterns |10j . 

4-2. Asymptotic learning curve for spherical weights 

The relation between a measure of learning performance and the amount of reference 
data p or the pattern ratio a is sometimes termed a learning curve. In statistical 
learning theory, the asymptotic behavior of learning curves is frequently examined, 
which is, however, limited mostly to the cases of i.i.d. patterns [231 [211 [251 [26] . We here 
employ the methodology that has been developed for the analysis of the asymptotic 
learning curve in order to investigate the effect of correlations in the pattern matrix. 
Investigations of this kind may be useful for active learning or experimental design 
contexts in which the pattern matrix can be designed to optimize learning performance 
[271 [281 [29]. 

As a representative example, let us consider the case of spherical weights P{w) oc 
6{\w\'^ — N), which implies that = 1. For generality, we investigate a model for which 
the second-order correlations of the pattern matrix X are characterized by an eigenvalue 
spectrum 

p(A) = (l-K)5(A) + /s:p(A), (25) 

where p(A) is a distribution, the support of which is defined over a certain region of 
A>0. 0</€<lis introduced to include the possibility of rank deficiency of the 
pattern matrix. 

For this situation, we evaluate q^, which serves as a performance measure 
representing the overlap between teacher and student perceptrons, for a ^ 1 solving 
the saddle point problem of equation ( [T6l) . For spherical weights, A^, which is the 
counterpart of A^ in equation f[T3|) for a; = T^; — = 1 — g^, is always fixed to unity 
at the saddle point. This yields four coupled equations relevant to the calculation of qui 
thus: 

(A) , 2 gF(l-g^,g„) (A) 1 

qu = \ ^ = h A« , (26) 

a a oqu a g^ 



«• =(l--)^ + -(^). (27) 

a/ A„ a \A„ + A/~ 
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Qu = , " / Dz — ! (28) 

TT (A) y/1 - aq^l (A) J H U/aquzj (A) j 

l_g^ = (l_^) + ^^_jk_^^, (29) 

where (■ ■ ■)- represents an average with respect to p(A). For a ^ 1, equations (1271) and 
( 128|) yield asymptotic relations q^ — iX — i^loi)/ Ku and 1 — aqu/ (A) ~ (ca/7r (A))^/g^ ~ 
(ca/TT (A))^A^, where c = f Dze'^^^^'^ / H{z) ~ 2.263. Inserting these relations into 
equation (126|) yields A„ ^ k {\) {-n / cf / ~ 1.926k (A) /a^. From this result and 
equation ( l29i) . we obtain the asymptotic learning curve 

1.926^2 (A) (A-i)~ _3 



a2 



0(a-'). (30) 



Two issues are noteworthy here. Firstly, in the current model, k denotes a fraction 
of the relevant dimensions that the pattern matrix X spans. Convergence — > k as 
a — >■ oo in equation (l30i) indicates that weights concerning the relevant dimensions are 
correctly identified, while no information is obtained from the irrelevant dimensions for 
perceptron learning. The rate of convergence scales as 0(fi;^a"^) = 0{{k,N)'^ /p'^), which 
indicates that the irrelevant dimensions do not affect the learning performance of the 
relevant weights. This is in accordance with existing results for singular statistical 
models in which some of the eigenvalues of the Fisher information matrix vanish, 
similar to cases of equation (l25l) such that < k < 1 \26\. Secondly, the inequality 
{^~^)p ^ (A)^^ = (A)~^, which holds because A is positive and (A) = k (A)~ is satisfied 
by equation (125|) . implies that q^ is asymptotically bounded above: 



3 



^ 1.926k ^/ xn , n 

qn^^K — + 0(a-^), (31) 



a 



where equality holds when (A^^)~ = (A)~^ is satisfied. This property is asymptotically 
satisfied for the i.i.d. patterns since equation (1211) yields (A~^) = (a — 1)~^ ~ 
= (A)^^ for a ^ 1. In addition, k = 1 holds for a ^ 1 of the i.i.d. 
patterns, which maximizes the value of convergence k to unity, reproducing the known 
asymptotic learning curve cos"^ (q^) / n ~ 0.625/a [7J. Therefore, the i.i.d. patterns 
are asymptotically optimal for the leading order of the learning curve although certain 
improvements can be gained for the next order by optimally designing the pattern 
matrix. 



4-3. Presumed optimal performance in the non- asymptotic region 

The above argument characterizes the optimal learning performance of simple 
perceptrons in the asymptotic region a ^ 1. On the other hand, in information theory, 
it is known that when w is transformed to y via y = Xw + n, where n is an noise vector 
whose components are i.i.d. Gaussian random numbers, X which is characterized by 
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(1 -a)5(A) + a5(A- 1), (0 < a < 1) 
6{\ — a), (a > 1), 



(32) 



maximizes the mutual information between w and y, Va > under the condition that 
the power of each column in X is equally constrained to a [30j. For < a < 1, this 
spectrum can be realized by composing X of p = Na randomly chosen orthonormal row 
vectors. On the other hand, for a > 1, patterns of randomly constructed orthogonal 
column vectors of dimension Na and length ^/a satisfy equation fl32|) . A set of row 
vectors of such pattern matrices is sometimes referred to as Welch bound equality (WBE) 
sequences [31] . Notice that these sequences achieve the upper-bound of equation (!3T|) in 
the asymptotic region since (A~^) = (A)~^ holds. This and the formal similarity between 
the channel problem and perceptron learning imply that pattern matrices characterized 
by equation fl32|) may maximize the learning performance of perceptrons Va > as 
well. Therefore, as the final example, we analyze the case of equation ( 132|) . utilizing the 
methodology of equation (fT6|) in the non-asymptotic region of a ~ 0(1) and compare 
its learning performance to that of the i.i.d. patterns to investigate optimality. 

For spherical weights, the saddle point problem of equation f|T6l) can be analytically 
solved for equation fl32|) for < a < 1 , yielding the solution = 2a /n, = 2/7r, 
which in turn implies that 



For a > 1, analytical construction of the solution is difficult and we resorted to a 
numerical method. Figures [U (a) and (b) show a comparison of the learning performance 
between the (presumed optimal) case of equation fl32l) and the i.i.d. patterns of equation 
( |2T|) . For both the teacher-student overlap q^ (figure [H (a)) and mutual information (free 
energy) $ (figured] (b)), equation (!32|) results in better learning performance than that 
of the i.i.d. patterns over the entire region of a > 0. 

As another representative learning model, we examined the case of binary (Ising) 
weights, the results of which are shown in figures [2] (a) and (b). For the i.i.d. patterns, 
it is known that simple perceptrons with binary weights exhibit perfect learning at 
a = a]:'^''^' ~ 1.245, completely identifying the teacher network [7]. Such behavior is 
also observed for the case of equation ( |32l) at a certain critical ratio a^'^^'^, which is 
characterized by the vanishing entropy condition S = Sq — ^ = log 2 — $ = 0. Figure [2] 
(a) yields a^^^""" ~ 1.101 for equation fl32|) . implying a better learning performance than 
that of the i.i.d. patterns. In terms of q^j, patterns of equation ( l32l) are also superior 
to the i.i.d. patterns (figure [2] (b)). In both figures [2] (a) and (b), numerical data for 
A^ = 100 systems obtained from 10^ experiments based on a Thouless- Anderson-Palmer 
type mean field method 132] are in very close agreement with the saddle point 
solutions of equation f|T6|) (curves), which justifies the methodology based on equation 



The superiority of equation (!32|) is also confirmed by experimental assessment of 
Q,poPT_ pigurg [3] shows the result of exhaustive search experiments for small systems 



$ = a log 2. 



(33) 




Figure 1. Performance curves for spherical weights, (a) vs. a. (b) $ vs. a. 
"POPT" and "i.i.d." denote data for the presumed optimal and i.i.d. patterns, 
respectively. Similarly for other figures. 




Figure 2. (a) vs. a. (b) S vs. a. The curves represent the theoretical prediction 
evaluated by the replica method. The markers are obtained from 10'* experiments for 
iV = 100 systems utilizing a Thouless- Anderson-Palmer type mean field method [15] . 



(6 < < 19). In order to characterize the critical ratio for finite systems, for each 
pair of and p, we estimated the probabihty r{N,p) that at least one weight vector w 
that differs from the teacher vector wq is completely compatible with a given reference 
data set utilizing 10^ experiments. For each A^, the critical ratio is defined as 
Q^poPTj^jy-j _ jv~^ ^p™^'' r(A^, p), where pmax is a sufficiently large threshold value to 
truncate the summation. We set Pmax = 4A^. a^'~'^^{N) is expected to converge to 
Q^POPT tends to infinity. In figure [3l the data plotted versus 1/A^ are asymmetric 
either side of a peak, which implies that it is necessary to use a higher order polynomial 
for estimating limjy^oo ^^^^"'"(A^) by extrapolation. Therefore, we fitted a fourth degree 
polynomial, which is supported by minimization of the leave-one-out cross validation 
error. The value limjv^oo «c°^"'"(^) — I-IH assessed by extrapolation agrees closely 
with the theoretical estimate af*^^""" ~ 1.101. This is considerably smaller than the 
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counterpart of the i.i.d. patterns, liniTv^oo ctc' ^^'l^) — 1-245, the theoretical value of 
which is a]:^''^' ~ 1.245, indicating the superiority of equation fl32|) . 

1.45( 1 




0.04 0.08 0.12 

1/N 

Figure 3. Critical ratio of perfect learning estimated by exhaustive search 
experiments. Data (markers) are obtained by 10^ experiments for systems of TV = 
6, • • • , 19. We fitted a fourth degree polynomial with respect to which is 

supported by a model selection scheme based on the leave-one-out cross validation, for 
assessing the values as ^ oo. Assessed values are limjv^oo *^^'^(-^) — 1.111 and 
limjv^oo a|,' '^ (iV) ~ 1.245 for the presumably optimal and i.i.d. patterns, respectively. 
These are in very close agreement with the theoretical predictions a^'~'^'^ ~ 1.101 and 



In conclusion, the above analyses for spherical and binary weights indicate that the 
eigenvalue spectrum of equation fl32l) always yields a better learning performance than 
that of the i.i.d. patterns. This lends some support to our conjecture that equation fl32l) 
achieves optimal learning performance for perceptrons operating under a fixed power 
constraint on the pattern matrix. 

5. Summary 

In summary, we have investigated the learning performance of simple perceptrons 
extending a methodology for handling correlated patterns developed in [HI |15] to a 
teacher-student scenario. The scheme allows us to characterize various second-order 
correlations among the input patterns by an eigenvalue spectrum of the cross-correlation 
matrix under an assumption that the right and left eigen-bases of the pattern matrix are 
independently generated from the Haar measure. Using this characterization, we have 
offered a general formula that relates the eigenvalue spectrum to the average free energy, 
which, in the current context, is a measure of the mutual information between the weight 
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vector and output labels, given a typical pattern matrix. The formula is used to examine 
cases for which column or row vectors in the pattern matrix are orthogonalized under a 
fixed power constraint, the learning performance of which is optimal for the asymptotic 
region and presumed to be optimal in general. Results from numerical experiments 
based on a Thouless-Anderson-Palmer type mean field method and exhaustive search 
examinations for small systems are in agreement with theoretical predictions obtained 
from the formula. 

A mathematical proof of the optimality of the eigenvalue spectrum ( l32l) and 
applications of the current scheme to various problems in learning and communication 
are promising future research directions. 
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